CoDoSA: A Lightweight, XML-Based Framework for Integrating Unstructured Textual Information
نویسندگان
چکیده
One of the most fundamental dimensions of information quality is access. For many organizations, a large part of their information assets is locked away in Unstructured Textual Information (UTI) in the form of email, letters, contracts, call notes, and spreadsheet. In addition to internal UTI, there is also a wealth of publicly available UTI on websites, in newspapers, courthouse records and other sources that can add value when combined with internally managed information. This paper describes a system called Compressed Document Set Architecture (CoDoSA) designed to facilitate the integration of UTI into a structured database environment where it can be more readily accessed and manipulated. The CoDoSA Framework comprises an XML-based metadata standard and an associated Application Program Interface (API). It further describes how CoDoSA can facilitate the storage and management of information during the ETL (Extract, Transform, and Load) process to integrate unstructured UTI information. It also explains how CoDoSA promotes higher information quality by providing several features that simplify the governance of metadata standards and enforcement of data quality constraints across different UTI applications and development teams. In addition, CoDoSA provides a mechanism for inserting semantic tags into captured UTI, tags that can be used in later steps to drive semantic-mediated queries and processes.
منابع مشابه
Apply Uncertainty in Document-Oriented Database (MongoDB) Using F-XML
As moving to big data world where data is increasing in unstructured way with high velocity, there is a need of data-store to store this bundle amount of data. Traditionally, relational databases are used which are now not compatible to handle this large amount of data, so it is needed to move on to non-relational data-stores. In the current study, we have proposed an extension of the Mongo...
متن کاملApply Uncertainty in Document-Oriented Database (MongoDB) Using F-XML
As moving to big data world where data is increasing in unstructured way with high velocity, there is a need of data-store to store this bundle amount of data. Traditionally, relational databases are used which are now not compatible to handle this large amount of data, so it is needed to move on to non-relational data-stores. In the current study, we have proposed an extension of the Mongo...
متن کاملTEXTUAL AND INTER-TEXTUAL ANALYSES OF IRANIAN EFL UNDERGRADUATES’ TYPES OF ENGLISH READING TOWARDS DEVELOPING A CAREFUL READING FRAMEWORK
This study investigated textual and inter-textual reading of a group of Iranian EFL undergraduates’ careful English reading types. In this research, Khalifa and Weir’s (2009) reading framework was used to propose a more inclusive aspect of a careful reading framework and the reading construct for instructional and assessment goals. The participants of this study were B.A. students of English Tr...
متن کاملA framework for XML similarity joins
A prime motivation for using XML to directly represent pieces of information is the ability of supporting ad-hoc or “schema-later” settings. In such scenarios, modeling data under loose data constraints is essential. Of course, the flexibility of XML comes at a price: the absence of a rigid, regular, and homogeneous structure makes many aspects of data management more challenging. Such malleabl...
متن کاملShallow, Deep and Hybrid Processing with UIMA and Heart of Gold
The Unstructured Information Management Architecture (UIMA) is a generic platform for processing text and other unstructured, human-generated data. For text, it has been proposed and is being used mainly for shallow natural language processing (NLP) tasks such as part-of-speech tagging, chunking, named entity recognition and shallow parsing. However, it is commonly accepted that getting interes...
متن کامل